The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.
With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.
As a member of the Data Science and AI team in the startup, you have been tasked with analyzing the data, developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.
Date : The date the news was releasedNews : The content of news articles that could potentially affect the company's stock priceOpen : The stock price (in \$) at the beginning of the dayHigh : The highest stock price (in \$) reached during the dayLow : The lowest stock price (in \$) reached during the dayClose : The adjusted stock price (in \$) at the end of the dayVolume : The number of shares traded during the dayLabel : The sentiment polarity of the news contentNote: If the free-tier GPU of Google Colab is not accessible (due to unavailability or exhaustion of daily limit or other reasons), the following steps can be taken:
Wait for 12-24 hours until the GPU is accessible again or the daily usage limits are reset.
Switch to a different Google account and resume working on the project from there.
Try using the CPU runtime:
#included this line to elimate the gensim and numpy dependency and version issues
!pip install --upgrade pip -q
# installing the sentence-transformers and gensim libraries for word embeddings
!pip install -U sentence-transformers gensim transformers tqdm -q
restart the session and execute the above 2 lines to eliminate the numpy error dependencies.
# to read and manipulate the data
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None) # setting column to the maximum column width as per the data
# to visualise data
import matplotlib.pyplot as plt
import seaborn as sns
# to use regular expressions for manipulating text data
import re
# to load the natural language toolkit
import nltk
nltk.download('stopwords') # loading the stopwords
nltk.download('wordnet') # loading the wordnet module that is used in stemming
# to remove common stop words
from nltk.corpus import stopwords
# to perform stemming
from nltk.stem.porter import PorterStemmer
# To encode the target variable
from sklearn.preprocessing import LabelEncoder
# Patching scipy.linalg before importing gensim
import scipy.linalg # Import scipy.linalg
from numpy import triu # Import triu from numpy
scipy.linalg.triu = triu # Inject triu into scipy.linalg
# To import Word2Vec
from gensim.models import Word2Vec
import sklearn.metrics as metrics
# To tune the model
from sklearn.model_selection import GridSearchCV
# Converting the Stanford GloVe model vector format to word2vec
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
# Deep Learning library
import torch
# to load transformer models
from sentence_transformers import SentenceTransformer
# To split data into train and test sets
from sklearn.model_selection import train_test_split
# To build a Random Forest model
from sklearn.ensemble import RandomForestClassifier
# To compute metrics to evaluate the model
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, classification_report
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
# loading the dataset
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/NLP/Project/stock_news.csv')
data = df.copy()
data.head()
# checking a stock news
data.loc[3, 'News']
data.shape
Observation : There are 349 rows and 8 columns.
data.info()
Observations:
# changing the data type of Date column
data['Date'] = pd.to_datetime(data['Date'])
#verify Date column after converting to datetime type
data.info()
Observations:
data.describe().T
Observations:
#check for null values
data.isnull().sum()
Observations:
#check for duplicates
data.duplicated().sum()
Observations:
Label
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
legend=False,
hue=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Distribution of Sentiments
labeled_barplot(data, "Label", perc=True)
Observations:
Class 1(Positive): 22.9% is positive sentiment and is the class with the least count.
We see that the data is imbalanced.
Open Price
sns.histplot(data['Open'],kde=True)
plt.show()
Observations:
#calculate the mode
data['Open'].mode()
Observations:
# Creating a time series DataFrame with 'Date' as index
timeseries_df = data.set_index('Date')
timeseries_df['Month'] = timeseries_df.index.month # Extract month from the index
sns.boxplot(x='Month', y='Open', data=timeseries_df) # Box plot of 'Close' prices by month
plt.show()
Observations:
High Price
sns.histplot(data['High'],kde=True)
plt.show()
Observations:
#calculate the mode of High stock price
data['High'].mode()
Observations:
timeseries_df['Month'] = timeseries_df.index.month # Extract month from the index
sns.boxplot(x='Month', y='High', data=timeseries_df) # Box plot of 'Close' prices by month
plt.show()
Observations:
Low Price
sns.histplot(data['Low'],kde=True)
plt.show()
Observations:
Low stock price is right skewed distribution.
#calculate the mode of Low stock price
data['Low'].mode()
Observations:
timeseries_df['Month'] = timeseries_df.index.month # Extract month from the index
sns.boxplot(x='Month', y='Low', data=timeseries_df) # Box plot of 'Close' prices by month
plt.show()
Observations:
Close Price
sns.histplot(data['Close'],kde=True)
plt.show()
Observations:
#calculate the mode
data['Close'].mode()
Observations:
timeseries_df['Month'] = timeseries_df.index.month # Extract month from the index
sns.boxplot(x='Month', y='Close', data=timeseries_df) # Box plot of 'Close' prices by month
plt.show()
Observations:
Volume
sns.histplot(data['Volume'],kde=True)
plt.show()
Observations:
sns.boxplot(x=data['Volume'])
plt.show()
Observations:
Compute and check the distribution of the length of news content
#calculate the length of each news article
data['news_length'] = data['News'].apply(len)
#check the newly added column 'news_length'
data.head()
Check the distribution
Histogram
sns.histplot(data['news_length'], kde=True)
plt.xlabel('news length')
plt.ylabel('Frequency')
plt.title('Distribution of news length')
plt.show()
Observations:
print(data['news_length'].describe())
Observations:
sns.boxplot(x=data['news_length'])
plt.xlabel('news length')
plt.title('Box Plot of news length')
plt.show()
Observations:
Note: The above points are listed to provide guidance on how to approach bivariate analysis. Analysis has to be done beyond the above listed points to get maximum scores.
sns.pairplot(data=data,hue='Label')
Observations:
plt.figure(figsize=(15, 7))
numeric_df = data.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations:
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 3, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[0, 2].set_title("Distribution of target for target=" + str(target_uniq[2]))
sns.histplot(
data=data[data[target] == target_uniq[2]],
x=predictor,
kde=True,
ax=axs[0, 2],
color="blue",
)
axs[1, 0].set_title("Boxplot w.r.t target")
#sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
sns.boxplot(data=data, x=target, y=predictor, hue=target, ax=axs[1, 0], palette="gist_rainbow", legend=False)
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
hue=target,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
legend=False
)
plt.tight_layout()
plt.show()
Sentimental Polarity(Label) vs Open Price
distribution_plot_wrt_target(data, 'Open', 'Label')
Observations:
Sentimental Polarity(Label) vs High Price
distribution_plot_wrt_target(data, 'High', 'Label')
Observations:
Sentimental Polarity(Label) vs Low Price
distribution_plot_wrt_target(data, 'Low', 'Label')
Observations:
Sentimental Polarity(Label) vs Close Price
distribution_plot_wrt_target(data, 'Close', 'Label')
Observations:
Sentimental Polarity(Label) vs Volume
distribution_plot_wrt_target(data, 'Volume', 'Label')
Observations:
Price vs Date
Plot Open,High,Low and Close Prices Vs Date
#build timeseries plot
# Create subplots
fig,axes = plt.subplots(4, 1, figsize=(10, 12), sharex=True) # 4 rows, 1 column
# Plot each price on a separate subplot
sns.lineplot(x=timeseries_df.index, y=timeseries_df['Open'], label='Open', ax=axes[0])
sns.lineplot(x=timeseries_df.index, y=timeseries_df['High'], label='High', ax=axes[1])
sns.lineplot(x=timeseries_df.index, y=timeseries_df['Low'], label='Low', ax=axes[2])
sns.lineplot(x=timeseries_df.index, y=timeseries_df['Close'], label='Close', ax=axes[3])
# Add labels and titles
for ax, price_type in zip(axes, ['Open', 'High', 'Low', 'Close']):
ax.set_ylabel(price_type + ' Price')
ax.legend()
ax.grid(True)
axes[3].set_xlabel('Date') # X-axis label only on the bottom subplot
fig.suptitle('Time Series Plots of Stock Prices', fontsize=16) # Overall title
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels
plt.tight_layout() # Adjust spacing
plt.show() # Display the plot
Observations:
Volume vs Date
#plot Volume vs date
sns.lineplot(x='Date', y='Volume', data=timeseries_df)
plt.xlabel('Date')
plt.ylabel('Volume')
plt.title('Time Series Plot of Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()
Observations:
Volume of stocks traded every month
#Volume of stocks traded every month
timeseries_df['Month'] = timeseries_df.index.month
monthly_volume = timeseries_df.groupby('Month')['Volume'].sum()
sns.lineplot(x= monthly_volume.index, y=monthly_volume.values)
plt.xlabel('Month')
plt.ylabel('Total Volume')
plt.title('Monthly Volume of Stocks Traded')
plt.show()
Observations:
Plot Price and Volume in y axis and date on x axis to see how the change in price and Volume over a period of time happens and if there is a pattern.
Open Price, Volume vs date
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(timeseries_df.index, timeseries_df['Open'], color='blue', label='Open Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('Open Price', color='blue')
ax1.tick_params('y', labelcolor='blue')
ax2 = ax1.twinx() # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')
fig.tight_layout()
plt.title('Time Series Plot of Open Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()
Observations:
High Price, Volume vs date
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(timeseries_df.index, timeseries_df['High'], color='blue', label='High Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('High Price', color='blue')
ax1.tick_params('y', labelcolor='blue')
ax2 = ax1.twinx() # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')
fig.tight_layout()
plt.title('Time Series Plot of High Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()
Observations:
Low Price, Volume vs date
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(timeseries_df.index, timeseries_df['Low'], color='blue', label='Low Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('Low Price', color='blue')
ax1.tick_params('y', labelcolor='blue')
ax2 = ax1.twinx() # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')
fig.tight_layout()
plt.title('Time Series Plot of Low Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()
Observations:
Close Price, Volume vs date
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(timeseries_df.index, timeseries_df['Close'], color='blue', label='Close Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('Close Price', color='blue')
ax1.tick_params('y', labelcolor='blue')
ax2 = ax1.twinx() # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')
fig.tight_layout()
plt.title('Time Series Plot of Close Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()
Observations:
Univariate Analysis
Multivariate Analysis
Sentimental Polarity(Label) vs Open Price
Sentimental Polarity(Label) vs High Price
Sentimental Polarity(Label) vs Low Price
Sentimental Polarity(Label) vs Close Price
Sentimental Polarity(Label) vs Volume
Price vs Date
Volume of stocks traded every month
dataset = data.copy()
# Loading the Porter Stemmer
ps = PorterStemmer()
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters and numbers
text = re.sub(r'[^A-Za-z\s]', '', text)
# Remove extra whitespaces
text = re.sub(r'\s+', ' ', text).strip()
# Split text into separate words
words = text.split()
# Removing English language stopwords
text = ' '.join([word for word in words if word not in stopwords.words('english')])
# Applying the Porter Stemmer on every word of a message and joining the stemmed words back into a single string
text = ' '.join([ps.stem(word) for word in words])
return text
# preprocessing the textual column
dataset['News_clean'] = dataset['News'].apply(preprocess_text)
#display cleaned text
dataset.head()
Please Note : Moved splitting the dataset after word embeddings
# Creating a list of all words in our data
words_list = [item.split(" ") for item in dataset['News_clean'].values]
# Creating an instance of Word2Vec
vec_size = 300
model_W2V = Word2Vec(words_list, vector_size = vec_size, min_count = 1, window=5, workers = 6)
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(list(model_W2V.wv.key_to_index)))
Let's check out a few word embeddings obtained using the model
# Checking the word embedding of a random word
word = "market"
model_W2V.wv[word]
# Checking the word embedding of a random word
word = "stock"
model_W2V.wv[word]
# Checking the word embedding of a random word
word = "analyst"
model_W2V.wv[word]
# Retrieving the words present in the Word2Vec model's vocabulary
words = list(model_W2V.wv.key_to_index.keys())
# Retrieving word vectors for all the words present in the model's vocabulary
wvs = model_W2V.wv[words].tolist()
# Creating a dictionary of words and their corresponding vectors
word_vector_dict = dict(zip(words, wvs))
def average_vectorizer_Word2Vec(doc):
# Initializing a feature vector for the sentence
feature_vector = np.zeros((vec_size,), dtype="float64")
# Creating a list of words in the sentence that are present in the model vocabulary
words_in_vocab = [word for word in doc.split() if word in words]
# adding the vector representations of the words
for word in words_in_vocab:
feature_vector += np.array(word_vector_dict[word])
# Dividing by the number of words to get the average vector
if len(words_in_vocab) != 0:
feature_vector /= len(words_in_vocab)
return feature_vector
# creating a dataframe of the vectorized documents
df_Word2Vec = pd.DataFrame(dataset['News_clean'].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
df_Word2Vec
from gensim.models import KeyedVectors
# load the Stanford GloVe model
filename = '/content/drive/MyDrive/Colab Notebooks/NLP/Project/glove.6B.100d.txt.word2vec'
glove_model = KeyedVectors.load_word2vec_format(filename, binary=False)
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(glove_model.index_to_key))
# Checking the word embedding of a random word
word = "market"
glove_model[word]
# Checking the word embedding of a random word
word = "stock"
glove_model[word]
# Checking the word embedding of a random word
word = "analyst"
glove_model[word]
# Retrieving the words present in the GloVe model's vocabulary
glove_words = glove_model.index_to_key
# Creating a dictionary of words and their corresponding vectors
glove_word_vector_dict = dict(zip(glove_model.index_to_key,list(glove_model.vectors)))
vec_size=100
def average_vectorizer_GloVe(doc):
# Initializing a feature vector for the sentence
feature_vector = np.zeros((vec_size,), dtype="float64")
# Creating a list of words in the sentence that are present in the model vocabulary
words_in_vocab = [word for word in doc.split() if word in glove_words]
# adding the vector representations of the words
for word in words_in_vocab:
feature_vector += np.array(glove_word_vector_dict[word])
# Dividing by the number of words to get the average vector
if len(words_in_vocab) != 0:
feature_vector /= len(words_in_vocab)
return feature_vector
# creating a dataframe of the vectorized documents
df_Glove = pd.DataFrame(dataset['News_clean'].apply(average_vectorizer_GloVe).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
df_Glove
# defining the model
sent_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# setting the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# encoding the dataset
embedding_matrix = sent_model.encode(dataset['News'], device=device, show_progress_bar=True)
# printing the shape of the embedding matrix
embedding_matrix.shape
# printing the embedding vector of the first news in the dataset
embedding_matrix[0,:]
Splitting the dataset
# Creating dependent and independent variables
X_word2vec = df_Word2Vec.copy()
X_glove = df_Glove.copy()
X_sent_transformer = embedding_matrix.copy()
y=dataset['Label']
def split(X,y):
# Initial split into training (80%) and testing (20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
# Further split the temporary set into validation (10%) and test (10%) sets
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)
return X_train,X_valid,X_test,y_train,y_valid,y_test
#Splitting the dataset.
X_train_word2vec,X_valid_word2vec,X_test_word2vec,y_train_word2vec,y_valid_word2vec,y_test_word2vec=split(X_word2vec,y)
X_train_glove,X_valid_glove,X_test_glove,y_train_glove,y_valid_glove,y_test_glove=split(X_glove,y)
X_train_sent_transformer,X_valid_sent_transformer,X_test_sent_transformer,y_train_sent_transformer,y_valid_sent_transformer,y_test_sent_transformer=split(X_sent_transformer,y)
Check the shape of training,validation and testing datasets of Word2vec,Glove and Sentence Transformers
print(X_train_word2vec.shape, X_test_word2vec.shape, X_valid_word2vec.shape)
print(y_train_word2vec.shape, y_test_word2vec.shape, y_valid_word2vec.shape)
print(X_train_glove.shape, X_test_glove.shape, X_valid_glove.shape)
print(y_train_glove.shape, y_test_glove.shape, y_valid_glove.shape)
print(X_train_sent_transformer.shape, X_test_sent_transformer.shape, X_valid_sent_transformer.shape)
print(y_train_sent_transformer.shape, y_test_sent_transformer.shape, y_valid_sent_transformer.shape)
Observations:
Model Evaluation Criterion
Model can make wrong prediction:
Which case is more important?
How to reduce this loss?
Function for confusion matrix
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(cm.shape[0], cm.shape[1])
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Building the model
rf_word2vec = RandomForestClassifier(random_state = 42)
# Fitting on train data
rf_word2vec.fit(X_train_word2vec, y_train_word2vec)
Confusion Matrix
confusion_matrix_sklearn(rf_word2vec, X_train_word2vec, y_train_word2vec)
Observations:
confusion_matrix_sklearn(rf_word2vec, X_valid_word2vec, y_valid_word2vec)
Observations:
# Predicting on train data
y_pred_train_word2vec = rf_word2vec.predict(X_train_word2vec)
# Predicting on validation data
y_pred_valid_word2vec = rf_word2vec.predict(X_valid_word2vec)
Classification Report
print(classification_report(y_train_word2vec, y_pred_train_word2vec))
Observations:
default_word2vec_report = classification_report(y_valid_word2vec, y_pred_valid_word2vec)
print(default_word2vec_report)
Observations:
# Building the model
rf_glove = RandomForestClassifier(random_state = 42)
# Fitting on train data
rf_glove.fit(X_train_glove, y_train_glove)
Confusion Matrix
confusion_matrix_sklearn(rf_glove, X_train_glove, y_train_glove)
Observations:
confusion_matrix_sklearn(rf_glove, X_valid_glove, y_valid_glove)
Observations:
# Predicting on train data
y_pred_train_glove = rf_glove.predict(X_train_glove)
# Predicting on validation data
y_pred_valid_glove = rf_glove.predict(X_valid_glove)
Classification Report
print(classification_report(y_train_glove, y_pred_train_glove))
Observations:
default_glove_report = classification_report(y_valid_glove, y_pred_valid_glove)
print(default_glove_report)
Observations:
rf_sent_transformer = RandomForestClassifier(random_state = 42)
# Fitting on train data
rf_sent_transformer.fit(X_train_sent_transformer, y_train_sent_transformer)
Confusion Matrix
confusion_matrix_sklearn(rf_sent_transformer, X_train_sent_transformer, y_train_sent_transformer)
Observations:
confusion_matrix_sklearn(rf_sent_transformer, X_valid_sent_transformer, y_valid_sent_transformer)
Observations:
# Predicting on train data
y_pred_train_sent_transformer = rf_sent_transformer.predict(X_train_sent_transformer)
# Predicting on validation data
y_pred_valid_sent_transformer = rf_sent_transformer.predict(X_valid_sent_transformer)
Classification Report
print(classification_report(y_train_sent_transformer, y_pred_train_sent_transformer))
Observations:
#included zero_division=1 to address the warning related to precision values
default_sent_report = classification_report(y_valid_sent_transformer, y_pred_valid_sent_transformer, zero_division=1)
print(default_sent_report)
Observations:
We'll try to address the class imbalance problem now with Class weights.
rf_word2vec_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
rf_word2vec_balanced.fit(X_train_word2vec, y_train_word2vec)
Confusion Matrix
confusion_matrix_sklearn(rf_word2vec_balanced, X_train_word2vec, y_train_word2vec)
Observations:
confusion_matrix_sklearn(rf_word2vec_balanced, X_valid_word2vec, y_valid_word2vec)
Observations:
# Predicting on train data
y_pred_train_word2vec_balanced = rf_word2vec_balanced.predict(X_train_word2vec)
# Predicting on test data
y_pred_valid_word2vec_balanced = rf_word2vec_balanced.predict(X_valid_word2vec)
Classification report
print(classification_report(y_train_word2vec, y_pred_train_word2vec_balanced))
Observations:
weighted_word2vec_report = classification_report(y_valid_word2vec, y_pred_valid_word2vec_balanced)
print(weighted_word2vec_report)
Observations:
rf_glove_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
rf_glove_balanced.fit(X_train_glove, y_train_glove)
Confusion Matrix
confusion_matrix_sklearn(rf_glove_balanced, X_train_glove, y_train_glove)
Observations:
confusion_matrix_sklearn(rf_glove_balanced, X_valid_glove, y_valid_glove)
Observations,
# Predicting on train data
y_pred_train_glove_balanced = rf_glove_balanced.predict(X_train_glove)
# Predicting on test data
y_pred_valid_glove_balanced = rf_glove_balanced.predict(X_valid_glove)
Classification report
print(classification_report(y_train_glove, y_pred_train_glove_balanced))
Observations:
weighted_glove_report = classification_report(y_valid_glove, y_pred_valid_glove_balanced, zero_division=1)
print(weighted_glove_report)
Observations:
rf_sent_transformer_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
rf_sent_transformer_balanced.fit(X_train_sent_transformer, y_train_sent_transformer)
Confusion Matrix
confusion_matrix_sklearn(rf_sent_transformer_balanced, X_train_sent_transformer, y_train_sent_transformer)
Observations:
confusion_matrix_sklearn(rf_sent_transformer_balanced, X_valid_sent_transformer, y_valid_sent_transformer)
Observations:
#predicting on train data
y_pred_train_sent_transformer_balanced = rf_sent_transformer_balanced.predict(X_train_sent_transformer)
#predicting on test data
y_pred_valid_sent_transformer_balanced = rf_sent_transformer_balanced.predict(X_valid_sent_transformer)
Classification Report
print(classification_report(y_train_sent_transformer, y_pred_train_sent_transformer_balanced))
Observations:
weighted_sent_report = classification_report(y_valid_sent_transformer, y_pred_valid_sent_transformer_balanced, zero_division=1)
print(weighted_sent_report)
Observations:
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier
rf_tuned = RandomForestClassifier(class_weight="balanced", random_state=42)
# defining the hyperparameter grid for tuning
parameters = {
"max_depth": list(np.arange(4, 15, 2)),
"max_features": ["sqrt", 0.5, 0.7],
"min_samples_split": [5, 6, 7],
"n_estimators": np.arange(30, 110, 10),
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')
# running the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_word2vec, y_train_word2vec)
# Creating a new model with the best combination of parameters
rf_word2vec_tuned = grid_obj.best_estimator_
# Fit the new model to the data
rf_word2vec_tuned.fit(X_train_word2vec, y_train_word2vec)
Confusion Matrix
confusion_matrix_sklearn(rf_word2vec_tuned, X_train_word2vec, y_train_word2vec)
Observations:
confusion_matrix_sklearn(rf_word2vec_tuned, X_valid_word2vec, y_valid_word2vec)
Observations:
# Predicting on train data
y_pred_train_word2vec_tuned = rf_word2vec_tuned.predict(X_train_word2vec)
# Predicting on validation data
y_pred_valid_word2vec_tuned = rf_word2vec_tuned.predict(X_valid_word2vec)
Classification Report
print(classification_report(y_train_word2vec, y_pred_train_word2vec_tuned))
Observations:
tuned_word2vec_report = classification_report(y_valid_word2vec, y_pred_valid_word2vec_tuned)
print(tuned_word2vec_report)
Observations:
# Choose the type of classifier
rf_tuned = RandomForestClassifier(class_weight="balanced", random_state=42)
# defining the hyperparameter grid for tuning
parameters = {
"max_depth": list(np.arange(4, 15, 2)),
"max_features": ["sqrt", 0.5, 0.7],
"min_samples_split": [5, 6, 7],
"n_estimators": np.arange(30, 110, 10),
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')
# running the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_glove, y_train_glove)
# Creating a new model with the best combination of parameters
rf_glove_tuned = grid_obj.best_estimator_
# Fit the new model to the data
rf_glove_tuned.fit(X_train_glove, y_train_glove)
Confusion Matrix
#Printing the confusion matrix
confusion_matrix_sklearn(rf_glove_tuned, X_train_glove, y_train_glove)
Observations:
#Printing the confusion matrix
confusion_matrix_sklearn(rf_glove_tuned, X_valid_glove, y_valid_glove)
Observations:
# Predicting on train data
y_pred_train_glove_tuned = rf_glove_tuned.predict(X_train_glove)
# Predicting on validation data
y_pred_valid_glove_tuned = rf_glove_tuned.predict(X_valid_glove)
Classification Report
print(classification_report(y_train_glove, y_pred_train_glove_tuned))
Observations:
tuned_glove_report = classification_report(y_valid_glove, y_pred_valid_glove_tuned)
print(tuned_glove_report)
Observations:
# Choose the type of classifier
rf_tuned = RandomForestClassifier(class_weight="balanced", random_state=42)
# defining the hyperparameter grid for tuning
parameters = {
"max_depth": list(np.arange(4, 15, 2)),
"max_features": ["sqrt", 0.5, 0.7],
"min_samples_split": [5, 6, 7],
"n_estimators": np.arange(30, 110, 10),
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')
# running the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_sent_transformer, y_train_sent_transformer)
# Creating a new model with the best combination of parameters
rf_sent_tuned = grid_obj.best_estimator_
# Fit the new model to the data
rf_sent_tuned.fit(X_train_sent_transformer, y_train_sent_transformer)
#Printing the confusion matrix
confusion_matrix_sklearn(rf_sent_tuned, X_train_sent_transformer, y_train_sent_transformer)
Observations:
#Printing the confusion matrix
confusion_matrix_sklearn(rf_sent_tuned, X_valid_sent_transformer, y_valid_sent_transformer)
Observations:
# Predicting on train data
y_pred_train_sent_tuned = rf_sent_tuned.predict(X_train_sent_transformer)
# Predicting on validation data
y_pred_valid_sent_tuned = rf_sent_tuned.predict(X_valid_sent_transformer)
Classification Report
print(classification_report(y_train_sent_transformer, y_pred_train_sent_tuned))
Observations:
tuned_sent_report = classification_report(y_valid_sent_transformer, y_pred_valid_sent_tuned)
print(tuned_sent_report)
Observations:
import pandas as pd
from xgboost import XGBClassifier
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
# Assuming y_train_word2vec is a pandas Series
y_train_xgb_word2vec = y_train_word2vec.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
#Fitting the model
xgb_word2vec = XGBClassifier(random_state=42, eval_metric='logloss')
xgb_word2vec.fit(X_train_word2vec, y_train_xgb_word2vec)
y_valid_xgb_word2vec = y_valid_word2vec.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
Confusion Matrix
confusion_matrix_sklearn(xgb_word2vec, X_train_word2vec, y_train_xgb_word2vec)
Observations:
confusion_matrix_sklearn(xgb_word2vec, X_valid_word2vec, y_valid_xgb_word2vec)
Observations:
# Predicting on train data
y_pred_train_xgbword2vec = xgb_word2vec.predict(X_train_word2vec)
# Predicting on validation data
y_pred_valid_xgbword2vec = xgb_word2vec.predict(X_valid_word2vec)
Classification Report
print(classification_report(y_train_xgb_word2vec, y_pred_train_xgbword2vec))
Observations:
default_xgb_word2vec_report = classification_report(y_valid_xgb_word2vec, y_pred_valid_xgbword2vec)
print(default_xgb_word2vec_report)
Observations:
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
y_train_xgb_glove = y_train_glove.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
y_valid_xgb_glove = y_valid_glove.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
# Building the model
xgb_glove = XGBClassifier(random_state = 42)
# Fitting on train data
xgb_glove.fit(X_train_glove, y_train_xgb_glove)
Confusion Matrix
confusion_matrix_sklearn(xgb_glove, X_train_glove, y_train_xgb_glove)
Observations:
confusion_matrix_sklearn(xgb_glove, X_valid_glove, y_valid_xgb_glove)
Observations:
# Predicting on train data
y_pred_train_xgb_glove = xgb_glove.predict(X_train_glove)
# Predicting on validation data
y_pred_valid_xgb_glove = xgb_glove.predict(X_valid_glove)
Classification Report
print(classification_report(y_train_xgb_glove, y_pred_train_xgb_glove))
Observations:
default_xgb_glove_report = classification_report(y_valid_xgb_glove, y_pred_valid_xgb_glove)
print(default_xgb_glove_report)
Observations:
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
y_train_xgb_sent_transformer = y_train_sent_transformer.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
y_valid_xgb_sent_transformer = y_valid_sent_transformer.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
xgb_sent_transformer = XGBClassifier(random_state = 42)
# Fitting on train data
xgb_sent_transformer.fit(X_train_sent_transformer, y_train_xgb_sent_transformer)
Confusion Matrix
confusion_matrix_sklearn(xgb_sent_transformer, X_train_sent_transformer, y_train_xgb_sent_transformer)
Observations:
confusion_matrix_sklearn(xgb_sent_transformer, X_valid_sent_transformer, y_valid_xgb_sent_transformer)
Observations:
# Predicting on train data
y_pred_train_xgb_sent_transformer = xgb_sent_transformer.predict(X_train_sent_transformer)
# Predicting on validation data
y_pred_valid_xgb_sent_transformer = xgb_sent_transformer.predict(X_valid_sent_transformer)
Classification Report
print(classification_report(y_train_xgb_sent_transformer, y_pred_train_xgb_sent_transformer))
Observations:
#included zero_division=1 to address the warning related to precision values
default_xgb_sent_report = classification_report(y_valid_xgb_sent_transformer, y_pred_valid_xgb_sent_transformer, zero_division=1)
print(default_xgb_sent_report)
Observations:
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier
xgb_tuned = XGBClassifier(random_state=42, eval_metric='mlogloss')
# defining the hyperparameter grid for tuning
parameters = {
"n_estimators": [10,30,50],
"subsample":[0.7,0.9,1],
"learning_rate":[0.05, 0.1,0.2],
"colsample_bytree":[0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,1]
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')
# running the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_word2vec, y_train_xgb_word2vec)
# Creating a new model with the best combination of parameters
xgb_word2vec_tuned = grid_obj.best_estimator_
# Fit the new model to the data
xgb_word2vec_tuned.fit(X_train_word2vec, y_train_xgb_word2vec)
Confusion Matrix
confusion_matrix_sklearn(xgb_word2vec_tuned, X_train_word2vec, y_train_xgb_word2vec)
Observations:
confusion_matrix_sklearn(xgb_word2vec_tuned, X_valid_word2vec, y_valid_xgb_word2vec)
Observations:
# Predicting on train data
y_pred_train_xgb_word2vec_tuned = xgb_word2vec_tuned.predict(X_train_word2vec)
# Predicting on validation data
y_pred_valid_xgb_word2vec_tuned = xgb_word2vec_tuned.predict(X_valid_word2vec)
Classification Report
print(classification_report(y_train_xgb_word2vec, y_pred_train_xgb_word2vec_tuned))
Observations:
xgb_tuned_word2vec_report = classification_report(y_valid_xgb_word2vec, y_pred_valid_xgb_word2vec_tuned)
print(xgb_tuned_word2vec_report)
Observations:
# Choose the type of classifier
xgb_tuned = XGBClassifier(random_state=42, eval_metric='mlogloss')
# defining the hyperparameter grid for tuning
parameters = {
"n_estimators": [10,30,50],
"subsample":[0.7,0.9,1],
"learning_rate":[0.05, 0.1,0.2],
"colsample_bytree":[0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,1]
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')
# running the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_glove, y_train_xgb_glove)
# Creating a new model with the best combination of parameters
xgb_glove_tuned = grid_obj.best_estimator_
# Fit the new model to the data
xgb_glove_tuned.fit(X_train_glove, y_train_xgb_glove)
Confusion Matrix
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_glove_tuned, X_train_glove, y_train_xgb_glove)
Observations:
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_glove_tuned, X_valid_glove, y_valid_xgb_glove)
Observations:
# Predicting on train data
y_pred_train_xgb_glove_tuned = xgb_glove_tuned.predict(X_train_glove)
# Predicting on validation data
y_pred_valid_xgb_glove_tuned = xgb_glove_tuned.predict(X_valid_glove)
Classification Report
print(classification_report(y_train_xgb_glove, y_pred_train_xgb_glove_tuned))
Observations:
xgb_tuned_glove_report = classification_report(y_valid_xgb_glove, y_pred_valid_xgb_glove_tuned)
print(xgb_tuned_glove_report)
Observations:
# Choose the type of classifier
xgb_tuned = XGBClassifier(random_state=42, eval_metric='mlogloss')
# defining the hyperparameter grid for tuning
parameters = {
"n_estimators": [10,30,50],
"subsample":[0.7,0.9,1],
"learning_rate":[0.05, 0.1,0.2],
"colsample_bytree":[0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,1]
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')
# running the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_sent_transformer, y_train_xgb_sent_transformer)
# Creating a new model with the best combination of parameters
xgb_sent_tuned = grid_obj.best_estimator_
# Fit the new model to the data
xgb_sent_tuned.fit(X_train_sent_transformer, y_train_xgb_sent_transformer)
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_sent_tuned, X_train_sent_transformer, y_train_xgb_sent_transformer)
Observations:
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_sent_tuned, X_valid_sent_transformer, y_valid_xgb_sent_transformer)
Observations:
# Predicting on train data
y_pred_train_xgb_sent_tuned = xgb_sent_tuned.predict(X_train_sent_transformer)
# Predicting on validation data
y_pred_valid_xgb_sent_tuned = xgb_sent_tuned.predict(X_valid_sent_transformer)
Classification Report
print(classification_report(y_train_xgb_sent_transformer, y_pred_train_xgb_sent_tuned))
Observations:
xgb_tuned_sent_report = classification_report(y_valid_xgb_sent_transformer, y_pred_valid_xgb_sent_tuned)
print(xgb_tuned_sent_report)
Observations:
# Summarize all the reports
def summarize_reports(reports, model_type):
"""Summarizes model reports in a structured format.
Args:
reports (dict): A dictionary of reports where keys are report names
and values are the actual report objects.
model_type (str): The type of model (e.g., "Random Forest", "XGBoost").
Returns:
None (prints the summary to the console)
"""
print(f"-----------------{model_type.upper()} MODELS---------------------- ")
for report_name, report in reports.items():
print(f"\n{report_name.replace('_', ' ').title()} Report:")
print(report)
print("-" * 50)
# Define dictionaries to hold your reports
random_forest_reports = {
"default_word2vec": default_word2vec_report,
"weighted_word2vec": weighted_word2vec_report,
"tuned_word2vec": tuned_word2vec_report,
"default_glove": default_glove_report,
"weighted_glove": weighted_glove_report,
"tuned_glove": tuned_glove_report,
"default_sentence_transformer": default_sent_report,
"weighted_sentence_transformer": weighted_sent_report,
"tuned_sentence_transformer": tuned_sent_report,
}
xgboost_reports = {
"default_word2vec": default_xgb_word2vec_report,
"tuned_word2vec": xgb_tuned_word2vec_report,
"default_glove": default_xgb_glove_report,
"tuned_glove": xgb_tuned_glove_report,
"default_sentence_transformer": default_xgb_sent_report,
"tuned_sentence_transformer": xgb_tuned_sent_report,
}
# Print the summarized reports
print("Metrics summary of all the models")
print("-" * 50)
summarize_reports(random_forest_reports, "Random Forest")
summarize_reports(xgboost_reports, "XGBoost")
Model Performance Summary :
We observe that there are top 4 models with the highest recall score of 46%. Below are the models
Lets dive further to look at other factors to choose the right model
Class-wise Recall :
XGBoost - Default Sentence Transformer Model
By Observation, we see XGBoost - Default Sentence Transformer Model class Wise recall score for Negative and positive sentiments are better than other models.
Computational Efficiency :
Hyper parameter tuning Models- needs more computational resources
For Sentence Transformers, we do not have to pre-process or clean the data, we can feed the input as is which saves computational resource
By Observation, we see that XGBoost - Default Sentence Transformer Model requires less computational resource
Other Metrics like F1-Score :
XGBoost - Default Sentence Transformer Model
By observation, we see that XGBoost - Default Sentence Transformer Model has the highest F1-Score of 45%
Final Model: Based on all the criteria above, XGBoost - Default Sentence Transformer Model is the best model.
# Predicting on validation data
y_pred_test_xgb_sent_transformer = xgb_sent_transformer.predict(X_test_sent_transformer)
Confusion Matrix
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
y_test_xgb_sent_transformer = y_test_sent_transformer.map({-1: 0, 0: 1, 1: 2}) # Change -1 to 0, 0 to 1, and 1 to 2
confusion_matrix_sklearn(xgb_sent_transformer, X_test_sent_transformer, y_test_xgb_sent_transformer)
Classification Report
#included zero_division=1 to address the warning related to precision values
final_model_report = classification_report(y_test_xgb_sent_transformer, y_pred_test_xgb_sent_transformer, zero_division=1)
print(final_model_report)
Final Model Summary : (XGBoost - Default Sentence Transformer)
Conclusion : Model has generalized well and has given the similar performance as validation dataset.
Important Note: It is recommended to run this section of the project independently from the previous sections in order to avoid runtime crashes due to RAM overload.
!pip install git+https://github.com/abetlen/llama-cpp-python.git
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 -q
# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python -q
# Function to download the model from the Hugging Face model hub
from huggingface_hub import hf_hub_download
# Importing the Llama class from the llama_cpp module
from llama_cpp import Llama
# Importing the library for data manipulation
import pandas as pd
from tqdm import tqdm # For progress bar related functionalities
tqdm.pandas()
# to ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
# loading the dataset
df_summarization = pd.read_csv('/content/drive/My Drive/Colab Notebooks/NLP/Project/stock_news.csv')
data_summarization = df_summarization.copy()
import torch
from llama_cpp import Llama
# Check if CUDA is available
if torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
print(device)
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"
# Using hf_hub_download to download a model from the Hugging Face model hub
# The repo_id parameter specifies the model name or path in the Hugging Face repository
# The filename parameter specifies the name of the file to download
model_path = hf_hub_download(
repo_id=model_name_or_path,
filename=model_basename
)
llm = Llama(
model_path=model_path,
n_threads=2, # CPU cores
n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
n_gpu_layers=43, # Change this value based on your model and your GPU VRAM pool.
n_ctx=5500, # Context window
)
data_summarization["Date"] = pd.to_datetime(data_summarization['Date']) # Convert the 'Date' column to datetime format.
# Group the data by week using the 'Date' column.
weekly_grouped = data_summarization.groupby(pd.Grouper(key='Date', freq='W'))
weekly_grouped = weekly_grouped.agg(
{
'News': lambda x: ' || '.join(x) # Join the news values with ' || ' separator.
}
).reset_index()
print(weekly_grouped.shape)
weekly_grouped
# creating a copy of the data
data_1 = weekly_grouped.copy()
Note:
The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.
As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.
For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:
Role: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role
You are an expert data analyst specializing in news content analysis.Task: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective
Analyze the provided news headline and return the main topics contained within it.Instructions: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
Output Format: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output
Return the output in JSON format with keys as the topic number and values as the actual topic.Full Prompt Example:
You are an expert data analyst specializing in news content analysis.
Task: Analyze the provided news headline and return the main topics contained within it.
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
Return the output in JSON format with keys as the topic number and values as the actual topic.
Sample Output:
{"1": "Politics", "2": "Economy", "3": "Health" }
Note:
The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.
As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.
For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:
Role: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role
You are an expert data analyst specializing in news content analysis.Task: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective
Analyze the provided news headline and return the main topics contained within it.Instructions: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
Output Format: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output
Return the output in JSON format with keys as the topic number and values as the actual topic.Full Prompt Example:
You are an expert data analyst specializing in news content analysis.
Task: Analyze the provided news headline and return the main topics contained within it.
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
Return the output in JSON format with keys as the topic number and values as the actual topic.
Sample Output:
{"1": "Politics", "2": "Economy", "3": "Health" }
# defining a function to parse the JSON output from the model
def extract_json_data(json_str):
import json
try:
# Find the indices of the opening and closing curly braces
json_start = json_str.find('{')
json_end = json_str.rfind('}')
if json_start != -1 and json_end != -1:
extracted_category = json_str[json_start:json_end + 1] # Extract the JSON object
data_dict = json.loads(extracted_category)
return data_dict
else:
print(f"Warning: JSON object not found in response: {json_str}")
return {}
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
return {}
import nltk
# Download the 'punkt_tab' resource
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
token_counts = [len(word_tokenize(text)) for text in data_1['News']]
max_tokens = max(token_counts)
print(max_tokens)
Instruction_1 = """
Role: You are an expert financial analyst specializing in market sentiment analysis and news summarization. Your primary role is to analyze weekly news articles related to a specific company and determine the top three positive and negative events that are most likely to affect its stock price.
Task: Analyze the provided news articles for the past week and identify the top three positive and negative events. These events should be the most significant occurrences reported in the news that could potentially influence the company's stock price. Summarize these events concisely and objectively.
Instructions:
1. Carefully read each news article provided for the specified week.
2. Extract key events or topics discussed in the articles.
3. Categorize the events as positive or negative based on their potential impact on the company's stock price. For example, a new product launch would generally be considered a positive event, while a product recall would be considered a negative event.
4. Rank the positive and negative events based on their significance and potential impact.
5. Select the top three most impactful positive events and the top three most impactful negative events.
6. Summarize each selected event in a clear and concise manner, avoiding subjective opinions or interpretations. Focus on factual reporting and avoid speculation.
7.Present the summarized events in a JSON format with two keys: "Positive Events" and "Negative Events." Each key should contain a list of the three summarized events in order of impact.
Example JSON Output:
{
"Positive Events": [
"Company announced a strategic partnership with a major industry player, potentially expanding its market reach.",
"Positive earnings report exceeding analysts' expectations, indicating strong financial performance.",
"New product launch receiving positive reviews and generating significant customer interest."
],
"Negative Events": [
"Product recall due to safety concerns, impacting sales and brand reputation.",
"Regulatory investigation initiated against the company, potentially leading to fines or penalties.",
"Key executive unexpectedly resigned, raising concerns about leadership stability."
]
}
"""
#length of the instructions
len(Instruction_1)
#Defining the response function
def response_mistral(prompt, news):
model_output = llm(
f"""
[INST]
{prompt}
News Articles: {news}
[/INST]
""",
max_tokens=5500, #Complete the code to set the maximum number of tokens the model should generate for this task.
temperature=0, #Complete the code to set the value for temperature.
top_p=0.95, #Complete the code to set the value for top_p
top_k=50, #Complete the code to set the value for top_k
echo=False,
)
final_output = model_output["choices"][0]["text"]
return final_output
Note: Use this section to test out the prompt with one instance before using it for the entire weekly data.
#prompt with one instance
#use iloc to get one week
#create dataframe to store the output
test_data = response_mistral(Instruction_1, data_1['News'][0])
print(test_data)
import pandas as pd
from IPython.display import display
# Assuming data_1 is your DataFrame and 'News' is the column containing news articles
# Set display options to show all text
pd.set_option("display.max_colwidth", None) # Display full column width
# Display the entire text of the first news article without truncation
display(data_1['News'][0])
data_1['model_response'] = data_1['News'].progress_apply(lambda x: response_mistral(Instruction_1, x))
Extract the JSON data
data_1['model_response_parsed'] = data_1['model_response'].apply(extract_json_data)
data_1['model_response_parsed']
Checking for empty dictionary
data_1[data_1["model_response_parsed"]=={}]
Example model parsed response
data_1['model_response_parsed'][4]
Create a dataframe from the JSON data
model_response_parsed_df = pd.json_normalize(data_1['model_response_parsed'])
model_response_parsed_df.head(2)
data_with_parsed_model_output = pd.concat([data_1, model_response_parsed_df], axis=1)
Remove square brackets from Positive and Negative Events
import re
def remove_brackets(text):
if isinstance(text, list): # Check if text is a list
text = ', '.join(text) if text else '' # Join list elements into a string
return re.sub(r'\[.*?\]', '', str(text)) # Convert text to string before applying re.sub
data_with_parsed_model_output['Positive Events'] = data_with_parsed_model_output['Positive Events'].apply(remove_brackets)
data_with_parsed_model_output['Negative Events'] = data_with_parsed_model_output['Negative Events'].apply(remove_brackets)
data_with_parsed_model_output.head(2)
#remove model_response and model_response_parsed
final_summary = data_with_parsed_model_output.drop(['model_response', 'model_response_parsed'], axis=1)
Display the Final Summary[ Date, News, Positive Events and Negative Events summary]
final_summary.head(2)
Conclusion and Business Recommendation on Sentimental Analysis:
Conclusion and Business Recommendation on weekly news summarization:
-
Power Ahead